feat: datumctl compute plugin — deploy and manage workloads from the CLI by scotwells · Pull Request #113 · datum-cloud/compute

scotwells · 2026-05-22T21:50:21Z

Summary

Adds the datumctl compute plugin so developers can deploy and manage containerized workloads on Datum Cloud directly from the CLI.

Commands shipped:

deploy — push a container image as a workload with flags or a manifest file; waits for rollout
destroy — tear down a workload with a confirmation prompt
status — show workload health, per-city placement summary, and the active revision
instances — list all running instances across cities, with describe for full detail
scale — adjust minimum replica count across all placements
rollout — watch live rollout progress, browse revision history, and roll back to any prior revision
restart — trigger a rolling restart of a workload or a specific city
quota — inspect per-city instance usage and surface quota-exceeded messages

Revision history is stored as a ConfigMap per workload so rollout history and rollout undo work without server-side tracking.

Dependencies

Depends on feat: extend datumctl with installable service plugins datumctl#198 (plugin dispatch foundation) — go.mod currently uses a replace directive pointing at that PR's worktree; the directive should be removed and replaced with a release tag once that PR merges.

What's not included

logs — telemetry service not yet implemented
Tests — next step is adding envtest-based integration tests for each command
cities / instance-types resource listing commands

…cheduling base After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up the wrong versions of two deps via conflict resolution: - go.datum.net/network-services-operator was left at v0.1.0 (from #113's old go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage - go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires milo v0.26.1, which has a broken downstreamclient (Apply method missing, ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2 (the version used by the federated-scheduling base) so downstreamclient compiles cleanly. service-catalog is updated to the latest available version. Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds the datumctl-compute plugin binary with commands for deploying and managing containerized workloads on Datum Cloud via the developer CLI. Commands: - deploy — create or update a workload from flags or a manifest file - destroy — delete a workload and clean up its revision history - status — show health, placement summary, and recent revision info - instances — list and describe running instances across cities - scale — adjust minimum replica count across placements - rollout — watch live progress, view history, and roll back revisions - restart — trigger a rolling restart of a workload or specific city - quota — inspect per-city instance usage and quota headroom Closes #98. Depends on datum-cloud/datumctl#198. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Within a project's virtual control plane, all resources live in the "default" namespace — the project slug is only used to route to the right control plane URL. Updated all commands to use util.ResourceNamespace ("default") instead of the project name as the k8s namespace. Also corrects the instance type default from "d1-standard-2" to "datumcloud/d1-standard-2" to match the format the admission webhook requires. Discovered while testing against the staging environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The datumctl module requirement was upgrading controller-runtime to v0.23.3, which broke compatibility with multicluster-runtime and milo. Eliminated the dependency by: - Inlining the --plugin-manifest protocol in main.go - Reading DATUM_API_HOST and DATUM_CREDENTIALS_HELPER from env directly in util/client.go instead of via plugin.Context()/plugin.Token() - Reading DATUM_ORG from env in root.go instead of via plugin.NewRootCmd - Dropping the now-unreachable internal/cmd/compute/client.go Also updates CI workflows to use go-version-file instead of a pinned go 1.24.0, and bumps golangci-lint to v2.12.2 which supports go 1.25. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Upgrades controller-runtime from v0.21.0 to v0.23.3 and multicluster-runtime from v0.21.0-alpha.8 to v0.23.3, which unblocks adding go.datum.net/datumctl as a direct dependency. The CLI plugin (datumctl-compute) now uses the official datumctl plugin SDK: - plugin.ServeManifest() for the --plugin-manifest protocol - plugin.NewRootCmd() for pre-wired org/project/output flags - plugin.Context() and plugin.Token() for credential access Controller breaking changes addressed: ClusterName distinct type, Watches callback signature, NewWebhookManagedBy generic API. A local milo provider fork is added at internal/provider/milo since the upstream package hasn't been updated for the ClusterName type change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Addresses 63 lint findings across errcheck, goconst, gocyclo, gofmt, prealloc, staticcheck, and unparam linters: - gofmt/goimports: reformat cmd/main.go, deploy.go, util/client.go, webhook - errcheck: assign discarded fmt.Fprint* and Flush returns to _ - staticcheck: update webhook to generic admission.Defaulter[T]/Validator[T] with WithDefaulter/WithValidator; fix SA4010 unused append in quota.go; remove redundant .ObjectMeta selectors in restart.go - unparam: rename four never-used function parameters to _ - gocyclo: extract helpers from watch.Rollout and quota.runQuota to reduce cyclomatic complexity below threshold - goconst: extract repeated string literals to named constants across controllers, validation, and tests - prealloc: preallocate slices with known capacity in validation and tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- errcheck: fix unchecked fmt.Fprint* returns in deploy, quota, rollout, scale - prealloc: preallocate allErrs in workload_validation.go and stateful test - gofmt: reformat destroy.go, instances.go, rollout.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- golangci.yml: exclude errcheck for internal/cmd/* — ignoring write errors on stdout/stderr is idiomatic in CLI tools - prealloc: preallocate allErrs in validateScaleSettingMetrics - gofmt: reformat status.go, instance_controller_test.go Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire ValidArgsFunction on every command that accepts a workload name (deploy, destroy, restart, rollout, rollout history, rollout undo, scale, status) and register flag completion for instances --workload. All completions call a shared CompleteWorkloadNames helper in internal/cmd/compute/util that fetches live workload names from the API and always returns ShellCompDirectiveNoFileComp so the shell never falls back to filename completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove ValidArgsFunction from deploy and replace with util.CompleteWorkloadNamesAndFlags, which wraps CompleteWorkloadNames with plugin.WithFlagCompletion from the datumctl SDK. - Add plugin.WithFlagCompletion to the datumctl plugin SDK so any plugin can get the same behaviour by wrapping their own ValidArgsFunction. - Bump go.datum.net/datumctl to b44de1c (adds WithFlagCompletion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the hardcoded datum-control-plane ClusterIssuer from the csi-webhook-cert component. DNS names stay since they are fixed by the service name and namespace. Each consuming overlay now supplies the issuer via a strategic merge patch, allowing different environments to use different cert issuers without forking the component. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cert issuer name is environment-specific configuration that belongs in the infra repo, not the compute overlay. The infra repo's base manager patch already owns the full webhook-server-tls volume definition including the issuer. Consumers deploying outside infra must patch the issuer in their own overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a printer.go with PrintJSON and PrintYAML helpers that commands can use to emit API resources as structured output. Extend completion.go with CompleteInstanceNames, CompleteCityCodes, and CompleteOutputFormats so all -o/--output, --city, and instance-name completions are driven from a single shared source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both commands now accept -o/--output with tab-completion. json/yaml emit the underlying API resource (InstanceList) or structured quota rows respectively. wide adds an INSTANCE TYPE column for instances. --no-headers suppresses the header row for table and wide. City completion is wired to CompleteCityCodes and instance describe gains tab-completion via CompleteInstanceNames. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add datumctl compute workloads (list) and workloads describe <name> commands. The list command shows NAME/HEALTH/READY/PLACEMENTS/IMAGE/AGE columns with --health and --city filters, -o table|wide|json|yaml, and a footer summary. The describe command replaces status with a unified config+health view: header block, per-placement per-city ready counts with inline degradation annotations, and a container spec block. Remove the now-redundant status command from root.go and delete its package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix duplicate TYPE/INSTANCE TYPE columns in instances -o wide (W3): populate TYPE from runtimeKind (sandbox/vm), INSTANCE TYPE from instType - Fix footer bucketing in instances list (W4): compute Running/Pending/Failed from actual status strings instead of hardcoding Failed=0 - Skip revision ConfigMap Gets in workloads list table mode (W5): only fetch per-workload revision when -o wide is requested, avoiding N round-trips on every list invocation - Compute health footer tallies after filters are applied (W9): previously counted all workloads then printed a filtered subset, making the summary misleading when --health or --city filters were active - Fix gofmt import ordering in workloads.go (B1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Before creating a workload, the deploy command now checks whether the required network(s) exist. If a network is missing, the user is offered the option to create a minimal auto-IPAM network in-place rather than hitting an opaque NetworkNotFound error post-submission. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… API - Add EnsureComputeEntitlement to gate all compute commands on an active service entitlement; prompts TTY users to request access and surfaces approval status - Rewrite quota command to query AllowanceBucket resources from the project VCP (milo-system namespace) instead of deriving usage from instance quota conditions - Add NewPlatformClient targeting the platform API server for ResourceRegistration lookups - Extract ListServiceQuota into util so other service plugins can reuse the quota display logic with their own resource type prefix and display metadata overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace hand-rolled HTTP entitlement code with a proper client-go implementation using go.miloapis.com/service-catalog types. Uses client.WithWatch to stream events from the API server and unblocks as soon as the Ready condition appears — no polling interval. Also adds ASCII progress bar to quota table output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The compute CLI client now serializes network-services-operator types (Network, NetworkBinding, SubnetClaim), so deploy can preflight and create networks on the user's behalf. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Deployment revisions are becoming a platform concept rather than a client concern. Remove the ConfigMap-backed revision ledger the CLI maintained per workload, along with the 'rollout history' and 'rollout undo' subcommands and the revision column in 'workloads'. 'rollout' remains as a live-progress watch. This also removes the only code path that serialized core/v1 ConfigMaps from the CLI, so the missing-corev1-scheme warning on deploy no longer occurs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cheduling base After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up the wrong versions of two deps via conflict resolution: - go.datum.net/network-services-operator was left at v0.1.0 (from #113's old go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage - go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires milo v0.26.1, which has a broken downstreamclient (Apply method missing, ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2 (the version used by the federated-scheduling base) so downstreamclient compiles cleanly. service-catalog is updated to the latest available version. Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… resolution The first conflict resolution in the aa9dc15 commit accidentally truncated workload_webhook.go, dropping the ValidateCreate method, its kubebuilder marker, and producing a syntactically invalid Default function body (extra brace + wrong return signature). Restore the file to match 5486adf's content (the authoritative post-lint-migration version). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The platform now stamps city-code, workload-name, workload-deployment-name, and placement-name directly onto Instances at creation time. The CLI can therefore resolve CITY/WORKLOAD/placement directly from those labels without performing cross-object joins. The prior approach keyed the WorkloadDeployment map on UID and looked up instances via WorkloadDeploymentUIDLabel. That UID is the edge/Karmada WD UID, which differs from the project-cluster WD UID, causing the join to fail across federation planes and producing "unknown"/"orphaned" output. The new label-first path reads CityCodeLabel, WorkloadNameLabel, PlacementNameLabel, and WorkloadDeploymentNameLabel (name is identical across all planes) before falling back to the WD Get/List join. A wdNameFromInstanceName helper strips the trailing ordinal suffix from the Instance name as a last-resort fallback for instances created before the labels existed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `compute deploy` rollout watcher reported PHASE=Done and exited within seconds of creating the workload, before any instances were scheduled. A WorkloadDeployment's Status.DesiredReplicas stays at zero until the controller first reconciles it, and computePhase treated zero desired as Done — so the very first poll of a fresh deployment looked complete. Resolve the wait target from the spec minimum while the controller has not yet reported a desired count, and require that no stale replicas remain before reporting Done so scale-downs and rolling updates aren't declared complete while old instances are still draining. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l-compute-plugin

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Consume the server-side status-blocking-reason contract: each resource's readiness condition (Instance/Ready, WorkloadDeployment/Available, Workload/Available) now carries a machine-readable reason and human message when not True. - Add ReadinessBlock helper in util/conditions.go: given a condition list and type, returns (reason, message, blocked) with no per-reason branching — the single reusable entry-point for the new contract. - InstanceStatus (list view): falls through to "Pending (<reason>)" from the Ready condition when no specific sub-condition check matches, replacing the bare "Pending" for unknown causes like SourceNotFound or ReferencedDataNotReady. - InstanceStatusDetail (describe view): falls through to "Pending — <reason>" with the message as detail, replacing "Unknown" for those same causes. - WorkloadHealth: surfaces the reason from Available when false, e.g. "Unavailable — SourceNotFound" instead of the generic message. - degradedAnnotation (workloads describe per-city line): rewritten to read the WorkloadDeployment's own Available condition; removes the per-instance List fetch and the quota/InstanceStatusDetail special-casing that was its only logic. - printBlockedDetail (rollout watch): rewritten to read the deployment's Available condition; removes the per-instance List fetch entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rovisioning status The Programmed condition starts as Unknown (not False) while programming is in progress, so the ConditionFalse-only checks were bypassed and the raw ProgrammingInProgress reason leaked through the Ready condition fallback. Widen the checks to status != True to cover both Unknown and False states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add three provider-emitted reason constants to the API types and map them to plain-English STATUS strings in the list and describe views: ImageUnavailable → Failed (image unavailable) InstanceCrashing → Failed (crashing) ConfigurationError → Failed (configuration error) Rename the PendingProgramming/ProgrammingInProgress cases from the misleading "network provisioning" to "Starting", which accurately describes the transient state without implying network work is involved. Failed statuses are already counted in the "N Failed" summary line via the existing strings.HasPrefix(status, "Failed") check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells · 2026-06-03T13:51:55Z

📋 Real-world UX issue from a user enabling compute

Heads up — we got a user report that surfaces a confusing first-run experience with the enablement flow, and I've traced it end-to-end via the staging audit logs. Sharing here since the fix touches this plugin.

What the user saw:

% datumctl compute instances list
Compute is not enabled for project "personal-project-153fe986".
Would you like to request access? [y/N]: y
Requesting access to compute for project "personal-project-153fe986"...
Error: requesting compute access: serviceentitlements.services.miloapis.com "compute" already exists

From their perspective this looks like a flat-out failure. In reality, their first attempt succeeded — compute was enabled.

What actually happened (from the audit trail):

First run created the entitlement successfully. ✅
But the backend takes a short while (~minutes in this case) to mark it Ready.
During that window, the CLI's "is compute enabled?" check keys off the entitlement's Ready status, not its existence — so it kept reporting "not enabled" and re-offering to request access.
Each retry tried to create the entitlement again and hit a 409 already exists, which we surfaced as a raw, scary error.

Why it matters for the product: the very first thing a new user does is turn compute on, and today that happy path can look broken even when it worked. The error message also leaks an internal resource name (serviceentitlements.services.miloapis.com) that means nothing to a user.

Proposed fix (branch fix/compute-entitlement-pending-state, built off this PR's branch): teach the enablement check to distinguish three states instead of two —

not requested → offer to request access (today's behavior)
requested but still activating → tell the user it's in progress and to try again in a moment (no re-prompt, no error)
active → proceed

…and treat a 409 already exists as "already requested, activation pending" rather than a fatal error. Net result: the user sees a calm "enablement in progress, hang tight" message instead of a stack of confusing failures.

Happy to fold this into this PR or send it as a follow-up — whichever you prefer. 🙏

scotwells mentioned this pull request May 22, 2026

Launch Datum Compute Service datum-cloud/enhancements#682

Open

scotwells changed the base branch from main to feat/federated-deployment-scheduling May 29, 2026 03:30

scotwells force-pushed the feat/datumctl-compute-plugin branch from a63c87a to c1186cb Compare May 29, 2026 03:30

scotwells and others added 23 commits May 28, 2026 22:33

fix: replace interface{} with any in printer helpers

886e8db

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells force-pushed the feat/datumctl-compute-plugin branch from c1186cb to 8c15212 Compare May 29, 2026 03:40

scotwells mentioned this pull request May 29, 2026

chore: remove milo v0.25.2 replace pin once service-catalog compatibility is resolved #123

Open

scotwells and others added 3 commits June 1, 2026 15:43

Merge branch 'feat/federated-deployment-scheduling' into feat/datumct…

f5a25aa

…l-compute-plugin

chore: ignore goreleaser dist output and local plugin binary

685e353

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells force-pushed the feat/datumctl-compute-plugin branch from 8bc1efb to 685e353 Compare June 1, 2026 21:23

scotwells mentioned this pull request May 26, 2026

docs: propose datumctl compute developer experience #111

Merged

scotwells and others added 3 commits June 1, 2026 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113
scotwells wants to merge 30 commits into
feat/federated-deployment-schedulingfrom
feat/datumctl-compute-plugin

scotwells commented May 22, 2026

Uh oh!

scotwells commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scotwells commented May 22, 2026

Summary

Dependencies

What's not included

Related

Uh oh!

scotwells commented Jun 3, 2026

📋 Real-world UX issue from a user enabling compute

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant